2022 iThome 鐵人賽

DAY 17

自我挑戰組

基於自然語言處理的新聞意見提取應用開發筆記系列第 17 篇

[Day-17] 使用 spacy-streamlit 互動式獲得意見提取結果

14th鐵人賽

jameshuang

2022-10-02 23:54:43

1063 瀏覽

分享至

Day-17 內容

使用 spacy-streamlit 互動式獲得意見提取結果
- 最終效果
- 如何使用 spacy-streamlit
  - 安裝
  - streamlit 程式碼撰寫
  - 執行 streamlit

使用 spacy-streamlit 互動式獲得意見提取結果

spacy-streamlit 是一個讓 streamlit 可以顯示 displaCy 效果的開源專案。

最終效果

首先用一張圖秀一下今天要介紹的內容有什麼效果～

使用 streamlit 對於現在的使用場景具備以下優點：

在 Text to analyze 的輸入格中，輸入要產生結果的文字，就會產生 span 與 dependency parsing 的視覺化結果，可以當作查詢功能配合 Label Studio 使用。
可以將不同的 Spacy span 標註顯示於同一個頁面當中，用來同時比較不同的意見提取方法的效果，便於優化意見提取方法。

如何使用 spacy-streamlit

安裝

pip install spacy-streamlit

streamlit 程式碼撰寫

先 improt 一些套件與載入 model

import stanza
import spacy_stanza
from ckip_transformers.nlp import CkipPosTagger, CkipNerChunker
import spacy
from spacy.matcher import DependencyMatcher
from spacy.tokens import Span

stanza.download("zh-hant")
nlp = spacy_stanza.load_pipeline("xx", lang='zh-hant')

加入要產生標註結果的程式碼

這段程式碼的重點是 generate_doc(text: str) 會輸入一個新聞段落（text），return 一個處理好的 Spacy doc。

ef add_ner(doc):
    ner_driver = CkipNerChunker(model="bert-base")
    ner = ner_driver([str(doc)], show_progress=False)
    ner_spans = []
    for entity in ner[0]:
        span = doc.char_span(entity.idx[0], entity.idx[1], label=entity.ner)
        if span is None:
           span = doc.char_span(entity.idx[0], entity.idx[1] + 1, label=entity.ner) 
        ner_spans.append(span)
    orig_ents = list(doc.ents)
    doc.ents = []
    doc.ents = orig_ents + ner_spans

def add_ckip_tag(doc):
    pos_driver = CkipPosTagger(model="bert-base")
    words = [[str(token) for token in doc]]
    pos = pos_driver(words, show_progress=False)
    for token, ckip_pos in zip(doc, pos[0]):
        token.tag_ = ckip_pos

pattern = [
  {
    "RIGHT_ID": "VE",
    "RIGHT_ATTRS": {"TAG":  "VE"}
  },
  {
    "LEFT_ID": "VE",
    "REL_OP": ">",
    "RIGHT_ID": "who_root",
    "RIGHT_ATTRS": {"DEP": "nsubj"}
  },
  {
    "LEFT_ID": "VE",
    "REL_OP": ">",
    "RIGHT_ID": "idea_root",
    "RIGHT_ATTRS": {"DEP": {"IN": ["ccomp", "parataxis"]}}
  }
]

version = "v0"

matcher = DependencyMatcher(nlp.vocab, validate=True)
matcher.add(f"{version}", [pattern])

def generate_doc(text: str):
    doc = nlp(text)
    add_ner(doc)
    add_ckip_tag(doc)

    matches = matcher(doc)
    matches_sorted = sorted(matches, key=lambda x: abs(x[1][0] - x[1][1]))
    if len(matches_sorted) > 1:
        matches_sorted = [match for match in matches_sorted if (match[1][0] == matches_sorted[0][1][0] and match[1][1] == matches_sorted[0][1][1])]

    if len(matches_sorted) > 0:
        first_match = matches_sorted[0]
        VE_id = first_match[1][0]
        who_root_id = first_match[1][1]

        VE_span = Span(doc, VE_id, VE_id+1, label="VERB")
        who_root_span = Span(doc, doc[who_root_id].left_edge.i, doc[who_root_id].right_edge.i+1, label="WHO")

        idea_spans = []
        for match in matches_sorted:
            match_id, token_ids = match
            
            idea_root_id = token_ids[2]
            idea_spans.append(Span(doc, doc[idea_root_id].left_edge.i, doc[idea_root_id].right_edge.i+1, label="OPINION"))


        doc.spans["sc"] = spacy.util.filter_spans([VE_span, who_root_span] + idea_spans)
    else:
        doc.spans["sc"] = []
    return doc

設計 streamlit 要顯示的內容

import spacy_streamlit
import streamlit as st

DEFAULT_TEXT = """媒體關注數位部部長唐鳳對數位中介服務法看法，唐鳳表示，監理業務不屬於數位部範圍，面對大型跨境數位平台，最重要的是要確保現實世界中覺得合理的價值，平台上也應該符合相關的社會價值，遵守常規。"""


st.title("My cool app")
text = st.text_area("Text to analyze", DEFAULT_TEXT, height=200)
doc = generate_doc(text)

spacy_streamlit.visualize_spans(doc, spans_key="sc")
spacy_streamlit.visualize_spans(doc, spans_key="sc")
spacy_streamlit.visualize_parser(doc)

程式碼解說
- st.title("My cool app") 設定頁面標題
- text = st.text_area("Text to analyze", DEFAULT_TEXT, height=200) 取得使用者的文字輸入，並存入 text，預設值為 DEFAULT_TEXT。
- spacy_streamlit.visualize_spans(doc, spans_key="sc") 視覺化顯示 doc 分類為 "sc" 的 span，之後可以改顯示不同意見提取方法所產生的 span，例如類別名為 "rule_0", "rule_1" 的 span，用於比較不同方法所產生的結果。
- spacy_streamlit.visualize_parser(doc) 視覺化顯示 doc 的 dependency parsing 結果。

執行 streamlit

在終端機中執行下方指令：

streamlit run 上方程式碼的檔名.py

接著會自動開啟 http://localhost:8501/，就可以看到結果了！

[Day-16] 改進 Label Studio 的標註方式（達到在同一頁面中標註每個新聞段落）

系列文

基於自然語言處理的新聞意見提取應用開發筆記共 17 篇

RSS系列文訂閱系列文

1 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22198 篇

完賽人數

602 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

基於自然語言處理的新聞意見提取應用開發筆記系列 第 17 篇